As the name implies, we want to reduce a set of several variables into few variables. In this session, we will practice two basic techniques:
Cluster analysis.
Factor analysis.
Simply speaking, this technique will organize the cases (rows) into a small set of groups, based on the information (the columns) available for each case. This technique will create a new variable, which will be of categorical type, generally ordinal.
Let me bring back the data we prepared in Python:
# clean memory
rm(list = ls())
# link to data file
link='https://github.com/EvansDataScience/CTforGA_integrating/raw/main/demfragiso_expo.RDS'
# a RDS file from the web needs:
myFile=url(link)
# reading in the data:
fromPy=readRDS(file = myFile)
# reset indexes to R paradigm:
row.names(fromPy)=NULL
# check data types
str(fromPy)
## 'data.frame': 141 obs. of 25 variables:
## $ Countryname : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ Officialstatename : chr "The Islamic Republic of Afghanistan" "The Republic of Albania" "The People's Democratic Republic of Algeria" "The Republic of Angola" ...
## $ InternetccTLD : chr ".af" ".al" ".dz" ".ao" ...
## $ iso2 : chr "AF" "AL" "DZ" "AO" ...
## $ iso3 : chr "AFG" "ALB" "DZA" "AGO" ...
## $ Regimetype : Ord.factor w/ 4 levels "Authoritarian"<..: 1 3 1 1 3 2 4 4 1 1 ...
## $ Overallscore : num 0.32 6.11 3.77 3.37 6.81 5.49 8.9 8.07 2.68 2.52 ...
## $ Electoralprocessandpluralism: num 0 7 3.08 1.33 9.17 7.5 10 9.58 0.5 0.42 ...
## $ Functioningofgovernment : num 0.07 6.43 2.5 2.86 5 5.71 8.57 6.79 2.5 2.71 ...
## $ Politicalparticipation : num 0 4.44 4.44 5 7.22 6.11 7.78 8.89 2.78 3.33 ...
## $ Politicalculture : num 1.25 5.63 5 5 5 3.13 8.75 6.88 5 4.38 ...
## $ Civilliberties : num 0.29 7.06 3.82 2.65 7.65 5 9.41 8.24 2.65 1.76 ...
## $ Total : num 8.99 4.48 6.01 7.62 3.55 ...
## $ X1ExternalIntervention : num 8.3 6.1 3.7 4.6 4.3 7 0.5 0.5 7.2 5.3 ...
## $ C1SecurityApparatus : num 10 4.8 6 7.2 4.9 5.7 2.7 1.6 6.4 5.9 ...
## $ P2PublicServices : num 9.8 4.4 5.6 9.3 4.8 3.9 2.8 2.3 5.5 3.5 ...
## $ E3HumanFlightandBrainDrain : num 7 8.3 5.5 6 3 6.8 0.5 1.6 4.3 3 ...
## $ E2EconomicInequality : num 8.1 2.9 5.6 8.9 4.9 3.6 1.8 2.3 5.1 5.3 ...
## $ S2RefugeesandIDPs : num 8.8 2.6 6.8 5.9 1.9 6.6 2 4.4 6.9 1.7 ...
## $ C3GroupGrievance : num 7.2 4.1 7.2 8.1 3.8 5.3 3.1 3.9 6.1 9.6 ...
## $ P1StateLegitimacy : num 8.7 5.5 7.8 8.2 4 6.9 0.5 0.6 9.1 8 ...
## $ C2FactionalizedElites : num 8.6 6.2 7.5 7.2 2.8 7 1.7 3.2 7.9 7.6 ...
## $ S1DemographicPressures : num 9 4.1 4.8 9 5.3 4.4 2.9 3.4 4.2 4.1 ...
## $ E1Economy : num 9.2 6.4 6.8 8.4 7.1 6.6 1.6 1.8 4.7 4.1 ...
## $ P3HumanRights : num 7.4 3.6 6.3 6.2 3.3 6 1.7 0.5 7.7 8.6 ...
We have several countries, and several columns. In clustering, we try to create groups so that we have the highest homogeneity within groups, and the highest heterogeneity between groups. The variables will serve compute some distance among the cases so that the clustering algorithms will try to detect the homogeneity and heterogeneity mentioned.
Here you should subset the data: For this case just keep the columns with numeric values without the categories or the summary scores, and renaming the index with the country names:
# select variables
dataToCluster=fromPy[,-c(1,2:7,13)]
#save the country names as the row index
row.names(dataToCluster)=fromPy$Countryname
set.seed(999)
library(cluster)
distanceMatrix=daisy(x=dataToCluster, metric = "gower")
projectedData = cmdscale(distanceMatrix, k=2)
The object projectedData is saving coordinates for each element in the data:
# save coordinates to original data frame:
fromPy$dim1 = projectedData[,1]
fromPy$dim2 = projectedData[,2]
# see some:
fromPy[,c('dim1','dim2')][1:10,]
## dim1 dim2
## 1 0.40766940 0.002254166
## 2 -0.06708838 -0.024849688
## 3 0.10496448 0.048467522
## 4 0.21713312 -0.019916926
## 5 -0.12747131 -0.025454222
## 6 0.02232614 -0.016713408
## 7 -0.38544488 0.015037246
## 8 -0.32517748 -0.003159269
## 9 0.13881726 0.121031763
## 10 0.08468431 0.202137453
library(ggplot2)
base= ggplot(data=fromPy,
aes(x=dim1, y=dim2,
label=Countryname))
base + geom_text(size=2)
Can you see some groups emerging?
An alternative is the dendogram:
# prepare hierarchical cluster
hc = hclust(distanceMatrix)
# very simple dendrogram
plot(hc,hang = -1,cex=0.5)
There are several techniques for clustering, each with pros and cons. We will review the hierarchical clustering here. As the name implies, it will create clusters by creating all the possible clusters, so it is up to us to identify how many clusters emerge. The dendogram helps to identify how many clusters can be found. The hierarchical approach has two different strategies. The agglomerative approach starts by considering each case (row) a cluster, and then, using the distances links the individual cases. It is not expected that all cases are linked during the first step. The second step will create more clusters, and so on. The linking process can also vary (we will use the ward linkage). On the other hand, the divisive approach starts by considering all the cases as one cluster, and then divides them.
Using function fviz_nbclust from the library factoextra, we can see how many clustered are suggested.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(dataToCluster,
hcut,
diss=distanceMatrix,
method = "gap_stat",
k.max = 10,
verbose = F,
hc_func = "agnes")
fviz_nbclust(dataToCluster,
hcut,
diss=distanceMatrix,
method = "gap_stat",
k.max = 10,
verbose = F,
hc_func = "diana")
We could accept the number of cluster suggested or not. Let’s use the suggestion:
Here you can compute the clusters. I am using both, the aggregative and the divisive . Notice that you have to indicate a priori the amount of clusters required.
NumberOfClusterDesired=4
#library(factoextra)
res.agnes= hcut(distanceMatrix,
k = NumberOfClusterDesired,
isdiss=TRUE,
hc_func='agnes',
hc_method = "ward.D2")
# Hierarchical technique- divisive approach
res.diana= hcut(distanceMatrix,
k = NumberOfClusterDesired,
isdiss=TRUE,
hc_func='diana',
hc_method = "ward.D2")
3.1 Save results to original data frame:
fromPy$agn=as.factor(res.agnes$cluster)
fromPy$dia=as.factor(res.diana$cluster)
3.2 Verify ordinality in clusters: Each cluster has a label. Should the result represent some ordering, the labels does not reflect that necessarily. Below, I show that I used the columns Overallscore to get an idea of the real ordering that the labels represent.
aggregate(data=fromPy,
Overallscore~agn,
FUN=mean)
## agn Overallscore
## 1 1 3.232683
## 2 2 6.193333
## 3 3 8.196061
## 4 4 2.575625
aggregate(data=fromPy,
Overallscore~dia,
FUN=mean)
## dia Overallscore
## 1 1 2.900652
## 2 2 7.014783
## 3 3 5.346818
## 4 4 8.216071
You could recode these values so that the labels represent an ascending order.
library(dplyr)
fromPy$agn=dplyr::recode_factor(fromPy$agn,
`4`='1',`1` = '2',`2`='3',`3`='4',.ordered = T)
fromPy$dia=dplyr::recode_factor(fromPy$dia,
`1` = '1',`3`='2',`2`='3',`4`='4',.ordered = T)
The hierarchical clustering process returns the silhouette information for each observation, a measure of how well a case has been classified. Silhouettes vary from -1 to +1, where the higher the positive value the better classified a case is. Low positive values informs of poor clustering. Negative values informs of wrongly clustered cases.
4.1 Plot silhouettes
fviz_silhouette(res.agnes)
## cluster size ave.sil.width
## 1 1 41 0.21
## 2 2 51 0.18
## 3 3 33 0.51
## 4 4 16 0.35
library(factoextra)
fviz_silhouette(res.diana)
## cluster size ave.sil.width
## 1 1 46 0.24
## 2 2 23 0.23
## 3 3 44 0.13
## 4 4 28 0.31
4.2 Detecting cases wrongly clustered
Previos results have saved important information. Let me keep the negative sihouette values:
agnEval=data.frame(res.agnes$silinfo$widths)
diaEval=data.frame(res.diana$silinfo$widths)
agnPoor=rownames(agnEval[agnEval$sil_width<0,])
diaPoor=rownames(diaEval[diaEval$sil_width<0,])
Now, I can see what countries are not well clustered:
library("qpcR")
bad_Clus=as.data.frame(qpcR:::cbind.na(sort(agnPoor),
sort(diaPoor)))
names(bad_Clus)=c("agn","dia")
bad_Clus
## agn dia
## 1 Algeria Czechia
## 2 Bangladesh Ghana
## 3 Benin Guyana
## 4 Chile Israel
## 5 Egypt Liberia
## 6 Gabon Namibia
## 7 Jordan North Macedonia
## 8 Kyrgyzstan Qatar
## 9 Liberia Serbia
## 10 Morocco Sierra Leone
## 11 Sierra Leone Spain
## 12 Sri Lanka Zambia
## 13 Turkey <NA>
Color the maps:
base= ggplot(data=fromPy,
aes(x=dim1, y=dim2,
label=Countryname))
agnPlot=base + labs(title = "AGNES") + geom_point(size=2,
aes(color=agn),
show.legend = T)
- For Hierarchical DIANA:
diaPlot=base + labs(title = "DIANA") + geom_point(size=2,
aes(color=dia),
show.legend = T)
library(ggpubr)
ggarrange(agnPlot, diaPlot,ncol = 2,common.legend = T)
Annotating outliers:
# If name of country in black list, use it, else get rid of it
LABELdia=ifelse(fromPy$Countryname%in%diaPoor,fromPy$Countryname,"")
LABELagn=ifelse(fromPy$Countryname%in%agnPoor,fromPy$Countryname,"")
- Prepare plot with the outlying labels:
library(ggrepel)
diaPlot=diaPlot +
geom_text_repel(aes(label=LABELdia),
max.overlaps=50,
min.segment.length =unit(0,'lines'))
agnPlot= agnPlot +
geom_text_repel(aes(label=LABELagn),
max.overlaps = 50,
min.segment.length = unit(0, 'lines'))
- Plot and compare:
ggarrange(agnPlot,
diaPlot,
ncol = 2,
common.legend = T)
fviz_dend(res.agnes,
k=NumberOfClusterDesired,
cex = 0.45,
horiz = T,
main = "AGNES approach")
fviz_dend(res.diana,
k=NumberOfClusterDesired,
cex = 0.45,
horiz = T,
main = "DIANA approach")
Let’s compare these clusters with the levels proposed by The Economist:
table(fromPy$Regimetype,fromPy$agn)
##
## 1 2 3 4
## Authoritarian 16 31 0 0
## Hybrid regime 0 9 20 0
## Flawed democracy 0 1 31 15
## Full democracy 0 0 0 18
table(fromPy$Regimetype,fromPy$dia)
##
## 1 2 3 4
## Authoritarian 38 8 0 1
## Hybrid regime 8 21 0 0
## Flawed democracy 0 15 22 10
## Full democracy 0 0 1 17
This way, you can use the clustering results to validate other classifications done theoretically or by simpler techniques (i.e. averaging).
Simply speaking, this technique tries to express in one (or few) dimension(s) the behavior of several others. FA assumes that the several input variables have ‘something’ in common, there is something latent that the set of input variables represent.
Let follow this steps:
Our dataForFA has the same data as the one we used for clustering.
dataForFA=dataToCluster
We know that we have two sets of variables, one related to democracy and other to fragility, all of them computed at the country level.
In Factor analysis we need a measure of similarity among variables (not cases). Let me compute the heterogenous correlation matrix (Pearson correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables).
library(polycor)
corMatrix=polycor::hetcor(dataForFA)$correlations
We can visualize this matrix:
library(ggcorrplot)
ggcorrplot(corMatrix,
type = "lower") +
theme(axis.text.x = element_text(size = 5),
axis.text.y = element_text(size = 5))
You should notice that the set of variables that belong to a concept are correlated among one another. Variables from different concepts should have a low correlation.
3.1 The amount of data should be enough for the correlation process:
library(psych)
psych::KMO(corMatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix)
## Overall MSA = 0.93
## MSA for each item =
## Electoralprocessandpluralism Functioningofgovernment
## 0.90 0.95
## Politicalparticipation Politicalculture
## 0.93 0.95
## Civilliberties X1ExternalIntervention
## 0.91 0.93
## C1SecurityApparatus P2PublicServices
## 0.95 0.92
## E3HumanFlightandBrainDrain E2EconomicInequality
## 0.95 0.94
## S2RefugeesandIDPs C3GroupGrievance
## 0.94 0.93
## P1StateLegitimacy C2FactionalizedElites
## 0.93 0.94
## S1DemographicPressures E1Economy
## 0.93 0.96
## P3HumanRights
## 0.93
3.2 The correlation matrix should not be an identity matrix:
# is identity matrix?
cortest.bartlett(corMatrix,n=nrow(dataForFA))$p.value>0.05
## [1] FALSE
3.2. The correlation matrix should not be singular
library(matrixcalc)
is.singular.matrix(corMatrix)
## [1] TRUE
If some conditions fail you may not expect a reliable result, however, you may continue to see the sources of the flaws.
fa.parallel(dataForFA, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
library(GPArotation)
resfa <- fa(dataForFA,
nfactors = 2,
cor = 'mixed',
rotate = "varimax",
fm="minres")
### see results
print(resfa$loadings)
##
## Loadings:
## MR1 MR2
## Electoralprocessandpluralism -0.225 -0.872
## Functioningofgovernment -0.516 -0.733
## Politicalparticipation -0.303 -0.753
## Politicalculture -0.434 -0.557
## Civilliberties -0.282 -0.933
## X1ExternalIntervention 0.679 0.454
## C1SecurityApparatus 0.731 0.467
## P2PublicServices 0.893 0.330
## E3HumanFlightandBrainDrain 0.769 0.265
## E2EconomicInequality 0.796 0.378
## S2RefugeesandIDPs 0.671 0.362
## C3GroupGrievance 0.409 0.485
## P1StateLegitimacy 0.471 0.825
## C2FactionalizedElites 0.561 0.672
## S1DemographicPressures 0.831 0.322
## E1Economy 0.800 0.290
## P3HumanRights 0.478 0.787
##
## MR1 MR2
## SS loadings 6.405 6.099
## Proportion Var 0.377 0.359
## Cumulative Var 0.377 0.736
You can see better using a cutoff:
print(resfa$loadings,cutoff = 0.5)
##
## Loadings:
## MR1 MR2
## Electoralprocessandpluralism -0.872
## Functioningofgovernment -0.516 -0.733
## Politicalparticipation -0.753
## Politicalculture -0.557
## Civilliberties -0.933
## X1ExternalIntervention 0.679
## C1SecurityApparatus 0.731
## P2PublicServices 0.893
## E3HumanFlightandBrainDrain 0.769
## E2EconomicInequality 0.796
## S2RefugeesandIDPs 0.671
## C3GroupGrievance
## P1StateLegitimacy 0.825
## C2FactionalizedElites 0.561 0.672
## S1DemographicPressures 0.831
## E1Economy 0.800
## P3HumanRights 0.787
##
## MR1 MR2
## SS loadings 6.405 6.099
## Proportion Var 0.377 0.359
## Cumulative Var 0.377 0.736
The previous results serve to indicate if variables group in a meaningful way. In our example, you want to know if the indicators in each set are grouped together. These previous results can alse be visualized:
fa.diagram(resfa,main = "EFA results")
8.1 Let’s remove some variables. These belong to ‘fragility’ but may be close to ‘democracy’:
ps=c("P1StateLegitimacy","P2PublicServices","P3HumanRights")
notPs=setdiff(names(dataForFA),ps)
dataForFA2=dataForFA[,notPs]
8.2 Recompute correlations
# esta es:
library(polycor)
corMatrix2=polycor::hetcor(dataForFA2)$correlations
8.3 Recheck conditions:
library(psych)
psych::KMO(corMatrix2)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix2)
## Overall MSA = 0.92
## MSA for each item =
## Electoralprocessandpluralism Functioningofgovernment
## 0.86 0.96
## Politicalparticipation Politicalculture
## 0.93 0.94
## Civilliberties X1ExternalIntervention
## 0.88 0.93
## C1SecurityApparatus E3HumanFlightandBrainDrain
## 0.96 0.94
## E2EconomicInequality S2RefugeesandIDPs
## 0.90 0.92
## C3GroupGrievance C2FactionalizedElites
## 0.90 0.95
## S1DemographicPressures E1Economy
## 0.88 0.94
cortest.bartlett(corMatrix2,n=nrow(dataForFA2))$p.value>0.05
## [1] FALSE
library(matrixcalc)
is.singular.matrix(corMatrix2)
## [1] FALSE
8.4. Get suggestions
fa.parallel(dataForFA2, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
8.5 Compute factors
library(GPArotation)
resfa <- fa(dataForFA2,
nfactors = 3,
cor = 'mixed',
rotate = "varimax",
fm="minres")
8.6 Explore results
Visually:
fa.diagram(resfa,main = "EFA results (2)")
If you find no theory to explain this result, you can try smaller amount
of factors (try 2):
library(GPArotation)
resfa <- fa(dataForFA2,
nfactors = 2,
cor = 'mixed',
rotate = "varimax",
fm="minres")
fa.diagram(resfa,main = "EFA results (3)")
8.7 Analyse new results
sort(resfa$communality)
## C3GroupGrievance Politicalculture
## 0.3999367 0.4983043
## S2RefugeesandIDPs Politicalparticipation
## 0.6022094 0.6673252
## E3HumanFlightandBrainDrain X1ExternalIntervention
## 0.6675217 0.6942566
## E2EconomicInequality E1Economy
## 0.7238521 0.7296954
## S1DemographicPressures C1SecurityApparatus
## 0.7409543 0.7523090
## C2FactionalizedElites Functioningofgovernment
## 0.7566783 0.8130100
## Electoralprocessandpluralism Civilliberties
## 0.8543870 0.9745699
sort(resfa$complexity)
## Electoralprocessandpluralism E3HumanFlightandBrainDrain
## 1.141434 1.177320
## E1Economy Civilliberties
## 1.196322 1.221408
## S1DemographicPressures Politicalparticipation
## 1.292111 1.353628
## S2RefugeesandIDPs E2EconomicInequality
## 1.424945 1.442187
## C1SecurityApparatus X1ExternalIntervention
## 1.540870 1.589541
## Functioningofgovernment Politicalculture
## 1.869219 1.987251
## C3GroupGrievance C2FactionalizedElites
## 1.990284 1.997956
It is not that we have the prefect solution, but you need to eventually stop.
Each factor is represented by an score:
summary(resfa$scores)
## MR1 MR2
## Min. :-2.2798 Min. :-2.3848
## 1st Qu.:-0.7765 1st Qu.:-0.8269
## Median : 0.1364 Median : 0.2563
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7811 3rd Qu.: 0.8296
## Max. : 1.7894 Max. : 1.4822
As you see, the scores are not in the range from zero to ten; let’s make the change:
library(BBmisc)
efa_scores=normalize(resfa$scores,
method = "range",
margin=2, # by column
range = c(0, 10))
# save the scores in the data
fromPy$Fragile_efa=efa_scores[,1]
fromPy$Demo_efa=efa_scores[,2]
You do not normally have a means to validate the scores, but our example data has pre computed scores. Let’s use those to see if our scores make sense.
library("ggpubr")
ggscatter(data=fromPy, x = "Overallscore", y = "Demo_efa",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "DemoIndex (original)", ylab = "DemoIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'
ggscatter(data=fromPy, x = "Total", y = "Fragile_efa",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "FragileIndex (original)", ylab = "FragileIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'
Let’s save the data frame fromPy for further use:
saveRDS(fromPy,file = 'fromPyPlus.RDS')